Frequent Case Generation in Ad Hoc Retrieval of Three Indian Languages - Bengali, Gujarati and Marathi

نویسندگان

  • Jiaul H. Paik
  • Kimmo Kettunen
  • Dipasree Pal
  • Kalervo Järvelin
چکیده

This paper presents results of a generative method for the management of morphological variation of query keywords in Bengali, Gujarati and Marathi. The method is called Frequent Case Generation (FCG). It is based on the skewed distributions of word forms in natural languages and is suitable for languages that have either fair amount of morphological variation or are morphologically very rich. We participated in the ad hoc task at FIRE 2011 and applied the FCG method on monolingual Bengali, Gujarati and Marathi test collections. Our evaluation was carried out with title and description fields of test topics, and the Lemur search engine. We used plain unprocessed word index as the baseline, and n-gramming and stemming as competing methods. The evaluation results show 30%, 16% and 70% relative mean average precision improvements for Bengali, Gujarati and Marathi respectively when comparing the FCG method to plain words. The method shows competitive performance in comparison to n-gramming and stemming.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word Similarity Datasets for Indian Languages: Annotation and Baseline Systems

With the advent of word representations, word similarity tasks are becoming increasing popular as an evaluation metric for the quality of the representations. In this paper, we present manually annotated monolingual word similarity datasets of six Indian languages – Urdu, Telugu, Marathi, Punjabi, Tamil and Gujarati. These languages are most spoken Indian languages worldwide after Hindi and Ben...

متن کامل

Bengali and Hindi to English Cross-language Text Retrieval under Limited Resources

This paper describes our experiment on two cross-lingual and one monolingual English text retrievals at CLEF in the ad-hoc track. The cross-language task includes the retrieval of English documents in response to queries in two most widely spoken Indian languages, Hindi and Bengali. For our experiment, we had access to a HindiEnglish bilingual lexicon, ’Shabdanjali’, consisting of approx. 26K H...

متن کامل

Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval

The Transliterated Search track has been organized for the third year in FIRE-2015. The track had three subtasks. Subtask I was on language labeling of words in code-mixed text fragments; it was conducted for 8 Indian languages: Bangla, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, Telugu, mixed with English. Subtask II was on ad-hoc retrieval of Hindi film lyrics, movie reviews and astr...

متن کامل

Bengali and Hindi to English CLIR Evaluation

Our participation in CLEF 2007 consisted of two Cross-lingual and one monolingual text retrieval in the Ad-hoc bilingual track. The cross-language task includes the retrieval of English documents in response to queries in two Indian languages, Hindi and Bengali. The Hindi and Bengali queries were first processed using a morphological analyzer (Bengali), a stemmer (Hindi) and a set of 200 Hindi ...

متن کامل

A Case Study in Decompounding for Bengali Information Retrieval

Decompounding has been found to improve information retrieval (IR) effectiveness for compounding languages such as Dutch, German, or Finnish. No previous studies, however, exist on the effect of decomposition of compounds in IR for Indian languages. In this case study, we investigate the effect of decompounding for Bengali, a highly agglutinative Indian language. Some unique characteristics of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011